Search CORE

5 research outputs found

A methodology for selective protection of matrix multiplications: A diagnostic coverage and performance trade-off for CNNs executed on GPUs

Author: Abella Ferrer Jaume
Agirre Troncoso Irune
Cazorla Almeida Francisco Javier
Fernández Muñoz Javier
Pérez Cerrolaza Jon
Publication venue: Institute of Electrical and Electronics Engineers (IEEE)
Publication date: 01/01/2022
Field of study

The ability of CNNs to efficiently and accurately perform complex functions, such as object detection, has fostered their adoption in safety-related autonomous systems. These algorithms require high computational performance platforms that exploit high levels of parallelism. The detection, control and mitigation of random errors in these underlying high computational platforms become a must according to functional safety standards. In this paper, we propose protecting, with a catalog of diagnostic techniques, the most computationally expensive operation of the CNNs, the matrix multiplication. However, this protection entails a performance penalty, and the complete CNN protection may be unaffordable for those systems operating with strict real-time constraints. This paper proposes a three-stage methodology to selectively protect CNN layers to achieve the required diagnostic coverage and performance trade-off: i) sensitivity analysis to misclassification per CNN layers using a statistical fault injection campaign, ii) layer-by- layer performance impact and diagnostic coverage analysis, and iii) selective layer protection. Furthermore, we propose a strategy to effectively compute the achievable diagnostic coverage of large matrices implemented on GPUs. Finally, we apply the proposed methodology and strategy in Tiny YOLO-v3, an object detector based on CNNs.Ikerlan authors have received funding from Elkartek grant project KK-2021/00123 of the Basque government. BSC authors have been partially supported by the Spanish Ministry of Science and Innovation under grant PID2019-107255GBC21/AEI/10.13039/501100011033.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

GPU devices for safety-critical systems: a survey

Author: Abella Ferrer Jaume
Calderón Torres Alejandro Josué
Cazorla Almeida Francisco Javier
Flores Barroso José Luis
Kosmidis Leonidas
Pérez Cerrolaza Jon
Publication venue: Association for Computing Machinery (ACM)
Publication date: 01/07/2023
Field of study

Graphics Processing Unit (GPU) devices and their associated software programming languages and frameworks can deliver the computing performance required to facilitate the development of next-generation high-performance safety-critical systems such as autonomous driving systems. However, the integration of complex, parallel, and computationally demanding software functions with different safety-criticality levels on GPU devices with shared hardware resources contributes to several safety certification challenges. This survey categorizes and provides an overview of research contributions that address GPU devices’ random hardware failures, systematic failures, and independence of execution.This work has been partially supported by the European Research Council with Horizon 2020 (grant agreements No. 772773 and 871465), the Spanish Ministry of Science and Innovation under grant PID2019-107255GB, the HiPEAC Network of Excellence and the Basque Government under grant KK-2019-00035. The Spanish Ministry of Economy and Competitiveness has also partially supported Leonidas Kosmidis with a Juan de la Cierva Incorporación postdoctoral fellowship (FJCI-2020- 045931-I).Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

On the safe deployment of matrix multiplication in massively parallel safety-related systems

Author: Abella Ferrer Jaume
Agirre Irune
Calderón Torres Alejandro Josué
Cazorla Almeida Francisco Javier
Fernández Muñoz Javier
Pérez Cerrolaza Jon
Publication venue: 'MDPI AG'
Publication date: 01/04/2022
Field of study

Deep learning technology has enabled the development of increasingly complex safety-related autonomous systems using high-performance computers, such as graphics processing units (GPUs), which provide the required high computing performance for the execution of parallel computing algorithms, such as matrix–matrix multiplications (a central computing element of deep learning software libraries). However, the safety certification of parallel computing software algorithms and GPU-based safety-related systems is a challenge to be addressed. For example, achieving the required fault-tolerance and diagnostic coverage for random hardware errors. This paper contributes with a safe matrix–matrix multiplication software implementation for GPUs with random hardware error-detection capabilities (permanent, transient) that can be used with different architectural patterns for fault-tolerance, and which serves as a foundation for the implementation of safe deep learning libraries for GPUs. The proposed contribution is complementary and can be combined with other techniques, such as algorithm-based fault tolerance. In particular, (i) we provide the high-performance matrix multiplication CUTLASS library with a catalog of diagnostic mechanisms to detect random hardware errors down to the arithmetic operation level; and (ii) we measure the performance impact incurred by the adoption of these mechanisms and their achievable diagnostic coverage with a set of representative matrix dimensions. To that end, we implement these algebraic operations, targeting CUDA cores with single instructions and multiple-thread math instructions in an NVIDIA Xavier NX GPU.The research of this paper has received funding from the European Union’s Horizon 2020 research and innovation programme (grant agreement No 871465 (UP2DATE)).Peer ReviewedPostprint (published version

Multidisciplinary Digital Publishing Institute

UPCommons. Portal del coneixement obert de la UPC

Directory of Open Access Journals

Towards functional safety compliance of matrix–matrix multiplication for machine learning-based autonomous systems

Author: Abella Ferrer Jaume
Agirre Troncoso Irune
Allende Imanol
Cazorla Almeida Francisco Javier
Fernández Muñoz Javier
Pérez Cerrolaza Jon
Publication venue: 'Elsevier BV'
Publication date: 01/12/2021
Field of study

Autonomous systems execute complex tasks to perceive the environment and take self-aware decisions with limited human interaction. This autonomy is commonly achieved with the support of machine learning algorithms. The nature of these algorithms, that need to process large data volumes, poses high-performance demands on the underlying hardware. As a result, the embedded critical real-time domain is adopting increasingly powerful processors that combine multi-core processors with accelerators such as GPUs. The resulting hardware and software complexity makes it difficult to demonstrate that the system will run safely and reliably. This is the main objective of functional safety standards, such as IEC 61508 or ISO 26262, that deal with the avoidance, detection and control of hardware or software errors. In this paper, we adopt those measures for the safe inference of machine learning libraries on multi-core devices, two topics that are not explicitly covered in the current version of standards. To this end, we adapt the matrix-matrix multiplication function, a central element of existing machine learning libraries, according to the recommendations of functional safety standards. The paper makes the following contributions: (i) adoption of recommended programming practices for the avoidance of programming errors in the matrix-matrix multiplication, (ii) inclusion of diagnostic mechanisms based on widely used checksums to control runtime errors, and (iii) evaluation of the impact of previous measures in terms of performance and a quantification of the achieved diagnostic coverage. For this purpose, we implement the diagnostic mechanisms on one of the ARM R5 cores of a Zynq UltraScale+ multi-processor system-on-chip and we then adapt them to an Intel i7 processor with native code employing vectorization for the sake of performance.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

SAFEPOWER project: Architecture for safe and power-efficient mixed-criticality systems

Author: Adele Maleki
Albicocco
Alfons Crespo
Alina Lenz
Aouad
ARM
ARP4761
Audsley
Awan
Baddam
Banakar
Banerjee
Banerjee
Baruah
Beldachi
Benini
Benini
Boemo
Carrascosa
Carvalho
Chou
Chou
Cochran
Coronel
Directorate
DiTomaso
Djahromi
EASA CS25
Esmaeilzadeh
Guang
Harada
Heiser
Hemani
Hoogheimestra
Hosseinabady
Hosseinabady
Hu
IEC
Ingemar Söderquist
Ingo Sander
Javier Coronel
Jevtic
Jiang
Johnny Öberg
Jon Pérez-Cerrolaza
Juan Carlos Diaz Garcia
Kerckhof
Kim Grüttner
Kopetz
Kreutz
Kreutz
Larrucea
Larrucea
Lee
Lee
Lee
Lee
Lee
Li
Lin
Maher Fakih
Masmano
MCP CRI Issue 3.0
Mello
Mikel Azkarate-Askasua
Mittal
Mohamed Tagelsir Mohammadat
Motruk
Nandakumar
Nelson
Nera González Romero
Obermaisser
Ogrenci Memik
Oklobdzija
Oliver
Ou
Pande
Papamichael
Paterna
Persya
Pillai
Prasad
Razi Seyyedi
Reda
Reehal
rgen
Robino
Roman Obermaisser
Rosvall
RTCA/DO
RTCA/DO
RTCA/DO-254
Rushby
Rushby
Rushby
Schreiner
Silvano
Simon Davidmann
Simunic
Skentzos
Smarr
Sotiriou-Xanthopoulos
Stangaciu
Sören Schreiner
Usman
Völp
Wang
Weiland
Wilson
Yang
Yang
Zaykov
Zhan
Zhuravlev
Öberg
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref